Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network

نویسندگان

چکیده

Despite having impressive vision-language (VL) pretraining with BERT-based encoder for VL understanding, the of a universal encoder-decoder both understanding and generation remains challenging. The difficulty originates from inherently different peculiarities two disciplines, e.g., tasks capitalize on unrestricted message passing across modalities, while only employ visual-to-textual passing. In this paper, we start two-stream decoupled design structure, in which cross-modal decoder are involved to separately perform each type proxy tasks, simultaneous pretraining. Moreover, pretraining, dominant way is replace some input visual/word tokens mask enforce multi-modal encoder/decoder reconstruct original tokens, but no token when fine-tuning downstream tasks. As an alternative, propose primary scheduled sampling strategy that elegantly mitigates such discrepancy via two-pass manner. Extensive experiments demonstrate compelling generalizability our pretrained by four Source code available at https://github.com/YehLi/TDEN.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Encoder-Decoder Framework Translating Natural Language to Database Queries

Machine translation is going through a radical revolution, driven by the explosive development of deep learning techniques using Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). In this paper, we consider a special case in machine translation problems, targeting to translate natural language into Structural Query Language (SQL) for data retrieval over relational database. ...

متن کامل

Single Image Reflection Removal Using Deep Encoder-Decoder Network

Image of a scene captured through a piece of transparent and reflective material, such as glass, is often spoiled by a superimposed layer of reflection image. While separating the reflection from a familiar object in an image is mentally not difficult for humans, it is a challenging, ill-posed problem in computer vision. In this paper, we propose a novel deep convolutional encoder-decoder metho...

متن کامل

A Recurrent Encoder-Decoder Network for Sequential Face Alignment

We propose a novel recurrent encoder-decoder network model for real-time video-based face alignment. Our proposed model predicts 2D facial point maps regularized by a regression loss, while uniquely exploiting recurrent learning at both spatial and temporal dimensions. At the spatial level, we add a feedback loop connection between the combined output response map and the input, in order to ena...

متن کامل

Japanese Text Normalization with Encoder-Decoder Model

Text normalization is the task of transforming lexical variants to their canonical forms. We model the problem of text normalization as a character-level sequence to sequence learning problem and present a neural encoder-decoder model for solving it. To train the encoder-decoder model, many sentences pairs are generally required. However, Japanese non-standard canonical pairs are scarce in the ...

متن کامل

Molecular all-photonic encoder-decoder.

In data processing, an encoder can compress digital information for transmission or storage, whereas a decoder recovers the information in its original form. We report a molecular triad consisting of a dithienylethene covalently linked to two fulgimide photochromes that performs as an all-photonic single-bit 4-to-2 encoder and 2-to-4 decoder. The encoder compresses the information contained in ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2021

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v35i10.17034